Circular Sequence Comparison with q-grams
نویسندگان
چکیده
Sequence comparison is a fundamental step in many important tasks in bioinformatics. Traditional algorithms for measuring approximation in sequence comparison are based on the notions of distance or similarity, and are generally computed through sequence alignment techniques. As circular genome structure is a common phenomenon in nature, a caveat of specialized alignment techniques for circular sequence comparison is that they are computationally expensive, requiring from super-quadratic to cubic time in the length of the sequences. In this paper, we introduce a new distance measure based on q-grams, and show how it can be computed efficiently for circular sequence comparison. Experimental results, using real and synthetic data, demonstrate ordersof-magnitude superiority of our approach in terms of efficiency, while maintaining an accuracy very competitive to the state of the art.
منابع مشابه
Exact Circular Pattern Matchings Using Bit-Parallelism and q-Gram Technique∗
We present three efficient algorithms for exact circular string matching. One of the algorithms is for single circular pattern and the others are for multiple circular patterns. Our algorithms apply q-grams and bit parallelism. The algorithms are given names CBNDMq, CMultiBNDM and CMultiBNDMq, respectively. These two problems can also be solved by some proposed multiple patterns matching algori...
متن کاملA faster and more accurate heuristic for cyclic edit distance computation
Sequence comparison is the core computation of many applications involving textual representations of data. Edit distance is the most widely used measure to quantify the similarity of two sequences. Edit distance can be defined as the minimal total cost of a sequence of edit operations to transform one sequence into the other; for a sequence x of length m and a sequence y of length n , it can b...
متن کاملQ-gram Analysis and Urn Models
Words of fixed size q are commonly referred to as q-grams. We consider the problem of q-gram filtration, a method commonly used to speed up sequence comparison. We are interested in the statistics of the number of q-grams common to two random texts (where multiplicities are not counted) in the non uniform Bernoulli model. In the exact and dependent model, when omitting border effects, a q-gram ...
متن کاملIndexing DNA Sequences Using q-Grams
We have observed in recent years a growing interest in similarity search on large collections of biological sequences. Contributing to the interest, this paper presents a method for indexing the DNA sequences efficiently based on q-grams to facilitate similarity search in a DNA database and sidestep the need for linear scan of the entire database. Two level index – hash table and c-trees – are ...
متن کاملSimilarity Joins of Text with Incomplete Information Formats
Similarity join over text is important in text retrieval and query. Due to the incomplete formats of information representation, such as abbreviation and short word, similarity joins should address an asymmetric feature that these incomplete formats may contain only partial information of their original representation. Current approaches, including cosine similarity with q-grams, can hardly dea...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015